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Massive  Data  Sets:  Visualization  and  Analysis  -  Statement  of  Problem  Studied 
1.  Introduction 

The  U.S.  Army  faces  radical  changes  in  its  operations  over  the  next  decade.  The  continuing 
downsizing  and  the  changing  nature  of  ground  warfare  threats  imply  an  increasing  reliance  on 
technology.  Two  principal  examples  of  this  are  the  emergence  of  an  enlarged  engagement  theatre 
and  the  consequent  emphasis  on  theatre  level  anti-air  and  antimissile  defense  (THAAD)  and  the 
increasing  emphasis  on  information  warfare.  Both  of  these  technology  enhancements  imply  the 
need  to  deal  with  the  rapid  analyses  of  massive  data  sets  and  subsequent  decision  making  based 
on  these  analyses. 

The  THAAD  theatre-level  defense  system  is  based  on  the  premise  that  enemy  forces  will 
have  increased  capabilities  to  wage  air  and  missile  warfare  against  ground  forces  as  was 
evidenced  by  the  capabilities  of  the  Iraqis  with  their  SCUD  missiles  in  the  Gulf  War. 
Consequently,  there  is  a  need  to  provide  theatre -level  defense.  The  THAAD  system  is  a  radar- 
based,  two-tier  anti-air,  antimissile  defense  system.  THAAD  itself  provides  a  high-level,  upper- 
tier  defense  which  is  intended  to  eliminate  most  of  the  incoming  threats.  Patriot  is  intended  as  a 
lower-level,  lower-tier  supplementary  defense  to  eliminate  remaining  threats.  Because  of  the  high 
information  content  supplied  by  the  detection  and  interception  electronics,  rapid  processing  is 
required.  The  incoming  missile  systems  are  likely  to  have  penetration  aids  (PENAIDS)  supplying 
false  targets  to  the  radars  and  other  target  detection  systems  such  as  IR.  Incoming  planes  are 
likely  to  have  their  own  electronic  countermeasures  in  the  form  of  standoff  jammers.  They  are 
also  likely  to  have  anti-radiation  missiles  (ARM)  which  would  ride  the  radar  beams  down  to  the 
friendly  forces  radar  antennas.  Incoming  tactical  ballistic  missiles  are  likely  to  have  missile -borne 
jammers  (TBMBJ).  In  addition,  there  are  electromagnetic  environmental  effects  (E)  such  as 
lightning,  EMP,  and  co-site  interference.  There  is  the  threat  of  standoff  tactical  nuclear  strikes 
which  while  not  targeted  in  the  theatre,  could  have  significant  EMP  and  initial  nuclear  radiation 
effects  on  both  the  defense  system  and  soldier  survivability.  All  of  these  factors  imply  an 
extremely  intensive  classification  and  discrimination  load  arising  from  sensor  collected  data. 

While  THAAD  is  oriented  to  a  comparatively  limited  theatre  of  engagement,  information 
warfare  is  a  global,  increasingly  important  threat.  The  modern  Army  runs  on  information  as  much 
as  it  runs  on  fuel,  weapons,  and  soldiers.  Attacks  against  information  systems  and  C  systems  can 
destroy  the  Army's  ability  to  project  force  and  wage  war  as  effectively  as  any  weapons  system. 
This  is  particularly  the  case  since  much  of  the  Army's  information  technology  relies  on  COTS 
hardware  and  software  systems,  perhaps  somewhat  modified  for  increased  physical  survivability. 
Moreover  the  majority  of  the  Army's  communications  system  traffic  is  carried  over  the 
commercial  network  infrastructure.  These  imply  that  Army  information  systems  and  computers 
are  subject  to  the  same  viruses,  worms,  trojan  horses,  trap  doors  and  other  hacker  threats  that 
commercial  machines  are  subject  to.  In  a  GAO  report  dated  8  May  1996,  the  results  of  a  DISA 
vulnerability  assessment  are  published.  (See  also  “Information  Systems  Threat  Assessment  (U)”, 
DIA  PC-1750-4-93,  dtd  November  93.)  DISA  carried  out  16,840  attacks  on  DoD  systems.  Of  that 
total,  14,819  attacks  were  successful,  only  593  attacks  were  detected,  and  only  30  of  those  were 
reported.  This  is  .178%  reported  attack  rate.  DISA  estimates  there  are  some  250,000  attacks 
annually  on  DoD  computers.  Again,  it  is  clear  that  network  traffic  flow  represents  an  enormous 
information  database  and  that  statistical  clustering  and  discrimination  techniques  are  key 
elements  to  detecting  and  tracking  information  and  C  attacks. 
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Motivated  by  these  two  examples,  1  am  proposing  to  conduct  research  on  certain  aspects  of 
the  analysis  of  massive  data  sets,  particularly  focusing  on  certain  algorithmic  and  visualization 
issues.  The  proposal  is  organized  as  follows.  Section  2  contains  a  discussion  of  the  computational 
and  visualization  impact  of  massive  data  sets  Section  3  contains  several  closely  related  proposed 
topics  for  research:  a)  clustering  complexity,  b)  visualization  complexity,  and  3)  effects  of 
quantization  of  data  in  massive  data  sets  Section  4  details  the  results  of  previous  support 
including  publications  and  interactions  with  Army  personnel  and  commands. 

2.  Impact  of  Massive  Data  Sets 

Wegman  (1995)  has  outlined  a  taxonomy  of  data  set  sizes  and  discussed  the  implication  of 
large  data  sets  in  terms  of  computational  complexity  and  visualization  limits.  He  points  out  that 
complexity  in  both  computation  and  visualization  increase  not  linearly  in  the  sample  size,  but 
more  as  proportional  to  the  order  of  magnitude  of  the  data  set  size. 

Traditional  statistical  methods  usually  focus  on  data  set  sizes  in  the  range  characterized  as 
tiny  to  small  (up  to  10,000  observations).  Indeed,  even  modern  exploratory  data  analysis 
techniques  rarely  consider  data  set  sizes  larger  than  those  characterized  as  medium  (1,000,000 
observations).  And  yet,  as  we  have  seen  in  the  case  of  THAAD  and  information  warfare 
scenarios,  it  is  likely  that  huge  or  massive  data  sets  (10  observations)  would  be  involved. 
Indeed,  much  of  the  data  would  require  real-time  processing  to  be  effective  in  these  military 
scenarios.  Wegman  (1995)  analyzes  the  complexity  of  a  number  of  algorithms  and  concludes  that 
algorithms  of  or  even  may  not  be  computational  feasible.  Consider  for  example  a  n  complexity 
algorithm  and  a  huge  data  set.  The  most  ambitious  supercomputer  goals  announced  are  the 
teraflop  computers,  which  would  take  3.17  years  to  compute  this  case.  Clustering  algorithms  (cf. 
Everitt,  1993)  are  usually  distance-based,  hence  for  clustering  points,  they  require  distance 
computations.  Thus  it  is  computationally  infeasible  to  use  conventional  clustering  algorithms  for 
huge  data  sets  even  with  teraflop  computers 

From  the  visualization  point  of  view,  it  is  clear  that  in  dealing  with  large  to  massive  data  sets, 
conventional  methods  become  problematic.  Consider  for  example  a  data  set  of  size  1010  .  The  1% 
outliers  themselves  amount  to  10  observations.  Suppose  that  we  could  invent  an  extremely 
efficient  encoding  of  the  data,  say  one  pixel  per  data  item.  This  is  sometimes  called  a  scatter  plot. 
The  question  is  with  the  resolution  of  the  normal  human  eye,  how  many  pixels  could  we  see. 
Wegman  (1995)  suggests  that  even  under  the  most  wildly  optimistic  scenario  we  are  unlikely  to 
be  able  to  visualize  more  than  107  observations. 

From  the  discussion  of  Section  1 ,  it  is  clear  that  the  Army  faces  significant  operational  issues 
that  depend  on  the  ability  to  analyze  and  visualize  massive  data  sets.  From  the  discussion  just 
given  in  Section  2,  it  is  clear  that  real-time  analysis  and  visualization  of  massive  data  sets  is  a 
nontrivial  problem. 

3.  Research  Tasks  and  Results 

Clustering  Complexity  -  Clustering  is  probably  the  single  most  important  problem  in 
discovering  structure  in  data,  i.e.  in  contemporary  language,  the  most  important  data  mining 
issue.  Cluster  analysis,  while  comparatively  hard  to  define,  refers  to  a  process  of  dividing  a  data 
set  into  relatively  homogeneous  subsets  where  a  priori  the  number  and  nature  of  the  subsets  is 
unknown.  Classification,  ordinarily  viewed  as  an  easier  problem,  refers  to  the  association  of  data 
points  with  predefined  groups  or  subsets  of  data.  Often  in  clustering,  there  is  a  training  data  set  in 
which  the  clusters  or  subsets  are  known  by  some  external  criterion.  An  adaptive  procedure  is 
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developed  based  on  the  training  set  which  classifies  new  data  into  the  clusters  discovered  in  the 
training  data.  This  is  sometimes  called  supervised  learning.  Unsupervised  learning  is 
accomplished  when  there  is  no  training  data. 

1  proposed  to  examine  density-based  methods  for  clustering.  In  contrast  with  distance  based 
methods,  most  conventional  nonparametric  density  estimators  have  a  complexity  of  O(n) 
although  for  multivariate  data,  the  multiplier  subsumed  in  may  be  large.  Thus  if  a  suitable  density 
based  clustering  algorithm  can  be  developed,  the  computational  complexity  is  likely  to  be 
reduced  from  O (n)  to  O(n).  A  number  of  successful  attacks  on  this  problem  has  been  made. 
Items  2,  3,  14,  17,  19,  23  and  53  in  the  publication  list  below  refer  to  nonparametric  density 
estimation  and  issues  of  computational  complexity.  In  addition,  a  Ph.D.  dissertation  has  been 
completed  by  my  student.  Am  rut  Champaneri  entitled:  Multivariate  Probability  Density 
Estimation:  Some  Statistical  Properties  and  by  my  student,  Sung  Ahn  entitled:  A  Maximum 
Likelihood  Method  for  Density  Estimation.  In  the  former  dissertation,  Dr.  Champaneri  created  a 
computational  algorithm  for  tessellating  multidimensional  space  with  complexity  0(n  log  n).  This 
tessellation  not  only  leads  to  a  histogram  like  maximum  likelihood  estimator,  but  also  leads  to  a 
natural  clustering  algorithm  of  complexity  O {n  log  n).  Dr.  Champenari  also  showed  consistency 
and  asymptotic  distribution  results  based  on  an  empirical  determination  of  the  creation  rate  of 
tiles  as  a  function  of  dimensions  and  the  number  of  tessellating  points.  Dr.  Ahn’s  work  focused 
adaptive  normal  mixture  models  for  nonparametric  density  estimation  and  created  a  Bayesian 
penalty  function  for  creation  of  too  many  mixture  terms.  The  adaptive  normal  mixture 
methodology  identifies  clusters  by  identifying  mixture  terms.  The  adaptive  mixture  methodology 
is  a  recursive  technique  that  requires  one  pass  through  the  data. 

Visualization  Complexity  -  The  problem  of  visualizing  large  data  sets  is  a  vexing  one.  As 
indicated  above,  the  standard  high-resolution  screen  has  about  106  pixels,  so  that  at  best  we  could 
hope  to  represent  106  observations.  Even  if  more  pixels  were  available,  the  ability  of  the  eye  to 
distinguish  pixels  is  limited  by  the  distance  between  foveal  cones  within  the  eye.  Alternative 
strategies  have  to  be  discovered.  I  have  advocated  the  use  of  immersive  techniques  (virtual 
reality)  and  three-dimensional  techniques  in  the  past.  Much  of  our  previous  Army  sponsored 
research  has  focused  on  these  techniques.  The  reason  for  using  these  techniques  is  that  the  third 
dimension  moves  from  a  pixel  to  a  voxel  setting  which  potentially  moves  us  from  106  pixels  to 
109  voxels.  This  gives  us  three  orders  of  magnitude  extra  “screen  real  estate.” 


Our  visualization  work  has  focused  on  methods  for  expanding  the  scope  of  data  that  may  be 
visualized.  In  part  this  is  closely  aligned  with  density  methods  discussed  in  the  previous  section 
because  densities  are  a  representation  of  data  when  the  overplotting  is  too  severe.  Items  2, ,  7,  11, 
12,  13,  14,  15,  16,  18,  19,  21,  24,  25,  26,  27,  28,  34,  35,  39,  40,  41,  42,  44,  45,  46,  47,  50,  51,  52, 
54,  55,  66,  69,  70,  and  71  are  all  items  related  to  data  visualization  and  computer  graphics. 
Perhaps  the  most  important  results  here  fall  into  two  categories:  1)  visual  data  mining  and  2)  low 
cost  immersive  environments.  In  the  former  category  we  have  combined  four  basic  techniques:  a) 
parallel  coordinates,  b)  grand  tour,  c)  saturation  brushing  and  d)  stereoscopic  displays  to  provide 
integrated  techniques  for  large  scale  data  analysis.  Visual  data  mining  has  been  demonstrated  on 
several  data  sets  of  quite  large  magnitude  e.g.  130,000  items  in  8  dimensions  and  58,000  items  in 
14  dimensions.  Strategies  we  have  devised  include  what  we  have  called  a  BRUSH-TOUR 
strategy  for  high  dimensional  clustering  and  a  TOUR-PRUNE  strategy  for  constructing  tree- 
based  decision  rules.  In  the  second  item,  low  cost  immersive  environments,  we  have  constructed 
what  we  have  called  the  MiniCAVE,  essentially  a  PC-based  voice-controlled  immersive 
environment  for  approximate  $20,000.  We  have  recently  been  notified  that  a  US  Patent  is  being 
issued  on  this  system.  Also  of  interest,  a  Ph.D.  dissertation  has  been  completed  by  my  student. 
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Rida  E.  A.  Moustafa  entitled:  Fast  Conceptual  Clustering  Algorithm  for  Data  Mining  and 
Visualization. 

Quantization  -  Quantization,  roundoff,  and  binning  are  aspects  of  a  similar  process  arising  in 
different  disciplines.  Quantization  is  the  language  normally  used  electrical  engineering/signal 
processing  referring  to  the  discretification  of  signals.  This  has  been  a  very  successful  strategy  for 
dealing  with  the  digitization  of  signals,  for  example  with  digital  audio.  Roundoff  is  of  concern  in 
the  numerical  analysis  community  and  refers  to  the  truncation  of  real  numbers  for  puiposes  of 
numerical  computing  in  a  fixed  word  length  computer.  Roundoff  analysis  again  is  a  success  story 
in  the  numerical  analysis  community.  Binning  is  used  sometimes  in  the  statistics  community  and 
refers  to  grouping  data  in  representative  groups.  Often  coupled  with  comparatively  elementary 
notions  like  histograms,  binning,  in  contrast  with  the  other  related  concepts,  does  not  seem  to 
enjoy  a  great  reputation  within  the  statistics  literature.  One  perhaps  significant  difference  between 
binning  and  grouping  in  the  statistics  community  and  quantization  and  roundoff  in  the  other 
communities  is  that  conventionally  binning  and  grouping  have  been  thought  to  be  comparatively 
coarse  approximations  whereas  quantization  and  roundoff  ideas  are  connected  fine 
approximations.  For  example,  quantization  of  audio  is  usually  done  at  16  bit  level  while 
quantization  in  images  is  usually  done  at  the  eight  to  twenty-four  bit  level,  i.e.  266  to  64  million 
colors.  In  contrast,  in  binning  for  histograms  we  often  think  in  terms  of  10  to  20  class  intervals.  1 
would  conjecture  few  statisticians  have  ever  thought  of  constructing  a  histogram  with  64  million 
bins.  Yet  this  would  not  be  entirely  unreasonable  in  dealing  with  a  terabyte  of  data. 

There  are  distinct  theoretical  and  computational  advantages  to  binning.  Binning  or 
quantization  seems  particularly  appropriate  to  setting  in  which  there  are  huge  to  massive  data  sets 
because  with  a  data  set  of  this  size,  the  bins  can  be  sufficiently  small  that  they  are  smaller  than  the 
limits  of  perception.  Moreover,  once  quantized  the  storage  requirements  for  the  data  are  likely  to 
be  considerably  smaller  since  the  only  information  required  is  the  bin  and  the  count  of  items  in 
that  bin.  My  Ph.D.  student,  Martin  Khumbah  completed  his  dissertation  entitled:  Mathematical 
Quantization  for  Massive  Datasets  on  this  topic.  Among  the  interesting  things  we  have  shown  is 
that  geometric  quantization  can  be  accomplished  effectively  up  to  about  5  or  6  dimensions  and  in 
this  range  there  is  almost  no  theoretical  loss  associated  with  quantization.  If  the  representors  of 
the  quantized  data  are  chosen  appropriately,  the  quantized  data  is  self-consistent  and  bias  remains 
unchanged  while  variance  is  reduced.  We  developed  results  on  computational  complexity  -  O(n), 
storage  complexity  -  3k,  where  k  is  the  number  of  quantized  regions,  and  strategies  for 
minimizing  distortion.  Items  49,  56,  62  and  63  refer  to  this  research.  Item  62  presented  at  the 
Interface  meeting  was  selected  for  the  session  “Best  of  the  Army  Research  Office.” 

Summary  of  the  Most  Important  Results 

Density  Estimation  and  Clustering 

•  Developed  a  computational  algorithm  for  tessellating  multidimensional  space  with 
complexity  0(n  log  n). 

•  This  tessellation  not  only  leads  to  a  histogram  like  maximum  likelihood  estimator, 
but  also  leads  to  a  natural  clustering  algorithm  of  complexity  Oft  log  n). 

•  Showed  consistency  and  asymptotic  distribution  results  based  on  an  empirical 
determination  of  the  creation  rate  of  tiles  as  a  function  of  dimensions  and  the  number 
of  tessellating  points. 

•  Constructed  an  algorithm  for  adaptive  normal  mixture  models  for  nonparametric 
density  estimation  and  created  a  Bayesian  penalty  function  for  creation  of  too  many 
mixture  terms. 
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•  Identified  clusters  by  identifying  mixture  terms. 

•  The  adaptive  mixture  methodology  is  a  recursive  technique  that  requires  one  pass 
through  the  data. 

Visualization 

•  Created  visual  data  mining  strategies  combining  four  basic  techniques:  a)  parallel 
coordinates,  b)  grand  tour,  c)  saturation  brushing  and  d)  stereoscopic  displays  to 
provide  integrated  techniques  for  large  scale  data  analysis. 

•  Visual  data  mining  has  been  demonstrated  on  several  data  sets  of  quite  large 
magnitude  e.g.  130,000  items  in  8  dimensions  and  58,000  items  in  14  dimensions. 

•  Devised  strategies  we  have  called  a  BRUSH-TOUR  strategy  for  high  dimensional 
clustering  and  a  TOUR-PRUNE  strategy  for  constructing  tree-based  decision  rules. 

•  Created  a  low  cost  immersive  environment  that  we  have  called  the  MiniCAVE, 
essentially  a  PC -based  voice-controlled  immersive  environment  for  approximate 
$20,000. 

•  MiniCAVE  is  voice  controlled  and  feature  stereoscopic  capability. 

•  We  have  recently  been  notified  that  a  US  Patent  is  being  issued  on  our  MiniCAVE 
system. 

Quantization 

•  Demonstrated  a  fast  algorithm  for  quantization  of  massive  datasets. 

•  Demonstrated  that  geometric  quantization  can  be  accomplished  effectively  up  to 
about  5  or  6  dimensions  and  in  this  range  there  is  almost  no  theoretical  loss 
associated  with  quantization. 

•  Showed  that  if  the  representors  of  the  quantized  data  are  chosen  appropriately,  the 
quantized  data  is  self-consistent  and  bias  remains  unchanged  while  variance  is 
reduced. 

•  Developed  results  on  computational  complexity  -  O(n),  storage  complexity  -  3k, 
where  k  is  the  number  of  quantized  regions,  and  strategics  for  minimizing  distortion. 
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146. 
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